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Abstract Mixtures of factor analyzers are becoming more and more popular in the area of 
model based clustering of high-dimensional data. According to the likelihood approach in 
C I data modeling, it is well known that the unconstrained log-likelihood function may present 

spurious maxima and singularities and this is due to specific patterns of the estimated co- 
variance structure, when their determinant approaches 0. To reduce such drawbacks, in this 
paper we introduce a procedure for the parameter estimation of mixtures of factor analyzers, 
which maximizes the likelihood function in a constrained parameter space. We then analyze 
^t"! , and measure its performance, compared to the usual non-constrained approach, via some 

simulations and applications to real data sets. 

Keywords Constrained estimation ■ Factor Analyzers Modeling ■ Mixture Models ■ 
', Model-Based Clustering. 

1 Introduction and motivation 

>• i 

I Finite mixture distributions have been receiving a growing interest in statistical modeling. 

. Their central role is mainly due to their double nature: they combine the flexibility of non- 

ly^ ' parametric models with the strong and useful mathematical properties of parametric models. 

According to this approach, when we know that a sample of observations has been drawn 
from different populations, we assume a specific distributional form in each of the under- 
, lying populations. The purpose is to decompose the sample into its mixture components, 

fT^ ' which, for quantitative data, are usually modeled as a multivariate Gaussian distribution, and 

to estimate parameters. The assumption of underlying normality, besides the elegant analytic 
properties, allows also to employ the EM algorithm for the ML estimation of the parame- 
ters. On the other side, when considering a large number of observed variables, Gaussian 
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mixture models can provide an over-parameterized solution as, besides the mixing weights, 
it is required to estimate th e mean vector and the covariance matrix for each component 
jPeel and McLachlanl . [2OO0l) . As a consequence, we observe at the same time an undue load 
of computationally intensive procedures for the estimation. 

This is the reason why a number of strategies have been introduced in the literature 
to avoid over-parameterized solutions. Among the v arious proposal, some authors dev el- 
oped methodologies for variable selec t ion (s ee, f.i., Liu et al\ ^2003') and Hoff ('2005|) in 
the Bayesian framework. |Pan and ShenI (l2007h and lRafterv and Dean (.2006.) in the frequen- 
tist one). They further motivate their approach from the observation that the presence of 
non-informative variables can be strongly misleading for some clustering methods. With 
the sa me purpose of parsimony, but a completely different approach, [Banfield and Ra ftervl 
lll993h devised a methodology to identify common patterns among the component-covariance 
matrices; their p roposal arose a great att e ntion in th e literature. Along a slig htly different 
line of thinking, iGhahramani and HiltonI ( Il997h and iMcLachlan et al\ ( 1200 3h proposed to 
employ latent variables to perform dimensional reduction in each component, starting from 
the consideration that in many phenomena some few unobserved features could be explained 
by the many observed ones. 

In this paper we address mixtures of factor analyzers by assuming that the data have 
been generated by a linear factor model with latent variables modeled as Gaussian mix- 
tures. Our purpose is to improve the performances of the EM algorithm, by facing with 
some of its issues and giving practical recipes to overcome them. It is well known that the 
EM algorithm generates a sequence of estimates, starting from an initial guess, so that the 
corresponding sequence of the log-likelihood values is not decreasing. However, the con- 
vergence toward the MLE is not guaranteed, because the log-likelihood is unbounded and 
presents local maxima. Another critical point is that the parameter estimates as well as the 
convergence of the whole estima tion process may be affected by the starting values (see, f.i., 
iMcLachlan and KrishnanI j2007h ') so that the final estimate crucially depends on the initial 
guess. This issue has been investigated by man y authors, starting from the seminal paper of 
iRedner and Walked (11984'). Along the lines of Ilngrassiail2004l) . in this paper we introduce 
and implement a procedure for the parameters estimation of mixtures of factor analyzers, 
which maximizes the likelihood function in a constrained parameter space, having no sin- 
gularities and a reduced number of spurious local maxima. We then analyze and measure its 
performance, compared to the usual non-constrained approach. 

We have organized the rest of the paper as follows. In Section [2] we summarize main 
ideas about Gaussian Mixtures of Factor Analyzer model; in Section [3] we provide fairly 
extensive notes conce rning the likeliho od function and the AECM algorithm. Some well 
known considerations ( lHathawavL[l985h related to spurious maximizers and singularities in 
the EM algorithm are recalled in Section |4] and motivate our proposal to introduce con- 
straints on factor analyzers. Further, we give a detailed methodology to implement such 
constraints into the EM algorithm. In Section |6] we show and discuss the improved perfor- 
mance of our procedure, on the ground of some numerical results based on both simulated 
and real data. Section|7]contains concluding notes and provides ideas for future research. 



2 The Gaussian Mixture of Factor analyzers 

Within the Gaussian Mixture (GM) model-based approach to density estimation and cluster- 
ing, the density of the J-dimensional random variable X of interest is modelled as a mixture 
of a number, say G, of multivariate normal densities in some unknown proportions TTi , . . . Kq. 
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That is, each data point is taken to be a reahzation of the mixture probabiHty density func- 
tion, 

f{x-e) = '^7t,u^-n^,Eg) (1) 

where (j)ij{x;fi,E) denotes the <i-variate normal density function with mean fi and covariance 
matrix E. Here the vector 9aM{d,G) of unknown parameters consists of the (G— 1) mixing 
proportions Tig, the Gxd elements of the component means jXg, and the \Gd{d+ 1) distinct 
elements of the component-covariance matrices Eg. Therefore, the G-component normal 
mixture model (O with unrestricted component-covariance matrices is a highly parametrized 
model. We crucially need some method for parsimonious parametrization of the matrices 
Eg, because they requires 0{cfi) parameters. Among the various proposals for dimensional- 
ity reduction, we are interested here in considering Mixtures of Gaussian Factor Analyzers 
(MGFA), which allows to explain data by explicitly modeling correlations between vari- 
ables in multivariate observations. We postulate a finite mixture of linear sub-models for the 
distribution of the full observation vector X, given the (unobservable) factors U. That is we 
can provide a local dimensionality reduction method by assuming that the distribution of the 
observation X,- can be given as 

X/ = jig +Ag\iig + e,g with probability Ttg{g=\,...,G) for / = 1 , . . . , w, (2) 

where Ag is a d x q matrix of factor loadings, the factors Uig, ■ ■ ■ ,U„g are ^(0,1^) dis- 
tributed independently of the errors e,^, which are independently ^(0, f distributed, and 
Wg is a d X d diagonal matrix [g = 1, . . . , G). We suppose that q < d, which means that q 
unobservable factors are jointly explaining the d observable features of the statistical units. 
Under these assumptions, the mixture of factor analyzers model is given by lO, where the 
g-th. component-covariance matrix Eg has the form 

Eg=AgA'g + ^g (g=l,...,G). (3) 

The parameter vector dMCFA{d,q, G) now consists of the elements of the component means 
IJ,g, the Ag, and the f along with the mixing proportions Kg [g = I, . . . ,G — I), on putting 
7lc = I — L/ii' ^g- Note that in the case oi q> 1, there is an infinity of choices for Ag, since 
model (|2]l is still satisfied if we replace Ag by A^H', where H is any orthogonal matrix of 
order q. As q(q — l)/2 constraints are needed for Ag to be uniquely defined, the number of 
free parameters, for each component of the mixture, is 

dq + d — 2'?('? ~ !)• 

Comparing the two approaches and willing now to measure the gained parsimony when 
we use mixtures of factor analyzers, with respect to the more usual gaussian mixtures, and 
denoting by \ 9covCMid,G)\ and |0coi'MCFa('^,'?,G)|, the number of the estimated parameters 
for the covariance matrices in the GM and MGFA models, respectively, we have to choose 
values of q such that the following quantity P is positive 

G 1 

P= \0CovCM{d,G)\- \ QcovMCFA{d,q,G) \ = —d{d + \)-G[dq-d+ ^lil- 1)] 

i.e.: 

P=j[{d-qf-{d + q)]. 
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This is the only requirement for parsimony. Now, we can express the relative reduction 
RR{d, q, G) =RR{d, q) given by 

X _ dcovCM {d,G)\-\ dcovCMFA {d,q,G)\ _ jd - q)^ -{d + q) 

\eco.cM{d,G)\ ~ d{d+\) ■ 

In Table[T]we report the relative reduction, in term of lower number of estimated parameters 
for the covariance matrices in the MGFA models, with respect to the GM models. 



Table 1 Relative reduction RR{d,q) 
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The relative reduction represents the extent to which the factor model offers a simpler 
interpretation for the behaviour of x than the alternative assumption given by the gaussian 
mixture model. 



3 The likelihood function and the EM algorithm for MGFA 

In this section we summarize the main steps of the EM algorithm for mixtures of Factor 
analyzers, see e.g. iMcLachlan and Peell bOOOh for details. 

Let X = (xi, . . . ,x„) be a sample of size n from density ([TJ, and let x,- (; = 1, . . . ,n) 

denotes the realization of X, in For given data X, parameters in ([!} can be estimated 

according to the likelihood approach via the EM algorithm, where the likelihood function is 
given by: 

L(0;X) = fl I X; U^r,lig,Eg) = fl I E <?'rf(x,;M,,Aj,f ,) \ , 

where we set Eg = AgA'^ + f j, {g= 1, . . . , G). Consider the augmented data {(x,',u,g,z,), i = 
1 , . . . , «}, where z,- = (za , . . . , Zig)' , with Zig = 1 if x,- comes from the g-th population and 
Zig = otherwise. Then, the complete-data likelihood function can be written in the form: 

L,(e;X) = fin ['^d (x,-|u;;M,,Aj,f j) U»is)^sT' ■ (4) 

In particular, due to the factor structure of the model, see iMeng and van Dv3 ( Il997t) , we 
have to consider the alternating expectation-conditional maximization (AECM) algorithm. 
Such a procedure is an extension of the EM algorithm that uses different specifications of 
missing data at each stage. The idea is to partition d = {Q\ , St)' in such a way that L(0;X) 

is easy to maximize for di given 02 and vice versa. Then, we can iterate between these two 
conditional maximizations until convergence. In this case Qi = {Kg, fig, g = I, . . . ,G} where 
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the missing data are the unobsei^ed group labels Z = (z'j , . . . , zj, ) , and the second part of the 

parameters vector is given by 62 = {{■^gj^g)^ S = where the missing data are the 

group labels Z and the unobserved latent factors U = (Un, . . . ,U„g)- Hence, the application 
of the AECM algorithm consists of two cycles, and there is one E-step and one CM-step 
alternatively considering 9i and 62 in each pair of cycles. 

First Cycle. Here it is 0i = {Kg, fig, g = 1, . . . ,G} where the missing data are the unob- 
served group labels Z = (z'j , . . . , z', ) . The complete data likelihood is 

= nn [^'' (x-Mg,^,) ^sT" ■ (5) 
/=ig=i 

The E-step on the first cycle on the (A:-|-l)-th iteration requires the calculation of Qi{di',d^^^) - 
Eg(i) |X} which is the expected complete-data log-likelihood given the data X and 

using the current estimate 0'*' for 6. In practice it requires calculating Eg(t){Z/^|X} and 
usual computations show that this step is achieved by replacing each Zig by its current con- 
ditional expectation given the observed data x/, that is we replace zig by where 



On the M-step, the maximization of this complete-data log-likelihood yields 

yn Jk+l) 

jk+i) _ U^i ^ig 



^ n 
i,ik+i) _ _}_y (k+i) 



where nf^^'^ = E'Liz[^^''. According to notation in iMcLachlan and Peell ll2000h . we set 

e(m/2)^(0(*+l)'^0W'y^ 



Second Cycle. Here it is do — {^g, g — 1, . . . , G} = {(Ag, ^g), g = 1, . . . ,G} where the 
missing data are the unobserved group labels Z and the latent factors U. Therefore, the 
complete data likelihood is 

LM) =flfl[<l>d Ui\nig■,^if+'\Eg) c^, iuig) K^'^] 

i=\g=\ 

= fin [<?.(x/|u,;,;M^'\Ag,fj).^,(u,.,)4'+^']"\ (7) 



where 



W (x,.|u,,;Mf+",A„f = ^-i^exp|-i(x,-M^') -A,u,;,)"f-i(x,-Mf+'' -A^u,;,)} . 



g 

1 

(^2nYl2-^\ 2 
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Now the complete data log-likelihood is given by 

nd I n c 

(02) = - Y ln2;r + ^ In ;r, + - ^ In | f - ' | 

g—\ /— Ig— 1 



E/«tr{ -A,u,,)(x,-Mr'^ -A,u,.,)"f-'}. (8) 



Some algebras lead to the following estimate of {{Ag, f i,), g = I,. . . , G}: 

-F, = diag{sr"-A,r«sr')}. 



where we set 

sr" = (iAr'))i:4r"(x,-Mr'')(x,-Mr^')' 

!=1 

W = I, - rf'A W + (X,- - /I,) (X, - ^i^)'7f . 

Hence the maximum likelihood estimates Ag and "Fg for A and f can be obtained by alter- 
natively computing the update estimates A^ and f g , by 

A^=sf^'^Yf[@i\' and f g+ = diag {8^" - A+y^S^ " } , (9) 

and, from the latter, evaluating the update estimates 7+ and by 

7+=Ag(AgA; + fg)-i and 0+ = I, - y^Ag + y^sf +''7g, (10) 

iterating these two steps until convergence on Ag and 'Pg, so giving a[*^'^ and f j,*^^'' . 

In summary, the procedure can be described as follows. For a given initial random clus- 
tering z'"', on the {k+1) —th iteration, the algorithm carries out the following steps, for 
g=l,...,G: 

1. Compute and consequently obtain Kg''^^\ fif^^\ nf^^^ and Sg*^^''; 

2. Set a starting value for Ag and f g from Sg*^^'' ; 

3. Repeat the following steps, until convergence on Ag and "Fg! 

(a) Compute 7+ and 0+ from ( fTOt : 

(b) Set7g^7+ and0g^0+; 

(c) Compute A+ ^ S^*+'V^(0g') and f + ^ diag{sf+'^ -A+7gSf+''}; 



(d) Set Ag ^ A+ and f g ^ f ^ 



To completely describe the algorithm, here we give more details on how to specify the 
starting values for Ag and f g from Sg*^'\ as it is needed in Step 2. 

Starting from the eigen-decomposition of Sg'^^^', say Sg'^^'' = AgBgAg, computed on 
the base of z,-*^'', the main idea is that Ag has to synthesize the "more important" relations 
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between the d observed features, see iMcNicholas and Murphvl (l2008h . Then, looking at the 
equality Eg = rgP'g + f g, the initial values of Ag were set as 

where dj is the jth largest eigenvalue of S j^^'^ and a,y is the ith element of the corresponding 
eigenvector aj (the 7th column in Ai,), for i € {1,2,...,/?} and j E {1,2, . . . ,q}. Finally the 

Wg matiices can be initialized by the position f g = diaglSg*^^'' — AgA^}. 
4 Likelihood maximization in constrained parametric spaces 

Properties of maximum likelihood estimation for normal mixture models have been deeply 
investigated. It is well known that ^{0) is unbounded on and may present many local 
maxima. Day (1969) was perhaps the first noting that any small number of sample points, 
grouped sufficiently close together, can give raise to spurious maximizers, corresponding 
to parameters points with greatly differing comp onent standard de viation. To overcome this 
issue and to prevent J^{9) from singulaiities, iHathawavl lll985h proposed a constrained 
maximum likelihood formulation for mixtures of univariate normal distributions, suggesting 
a natural extension to the multivariate case. Let c 6 (0, 1], then the following constraints 

min AfZftZy') >c (12) 
\<h^j<k ■' 

on the eigenvalues A of Ei,Ej^ leads to properly defined, scale-equivariant, consistent ML- 
estimators for the mixture-of-normal case, see Hennig (2004). It is easy to show that a suf- 
ficient condition for il2l is 

a<Xig<b, i=l,...,d; g=l,...,G (13) 

where A/g de notes the ;' th eig envalue of Eg i.e. A/g = A,(Zg), and for a,b £ such that 
a/b > c, see [ingrassia' ('2004|). Differently from (112b . condition dlSI l can be easily imple- 
mented in any optimization algorithm. Let us consider the constrained parameter space 0c 
of©: 

0c={{Ku...,7Zc,^i^,...,^ic,Eu■■■,Ea)e MMi+rf+(rf^+^)/2] , 

7ig>0,7ii-\ h7ic=^l,a<?iig<b, g=l,...,G i=l,...,d}. (14) 

Due to the structure of the covariance matrix Eg given in bound in ( 1131 ) yields 

KuMgA'g + ^g)>a and K.MsK + 'I'g) < b, g=l..---,G (15) 

where A„in(-) and Amax(-) denote the smallest and the largest eigenvalue of (•) respectively. 
Since AgAg and fg are symmetric and positive definite, then it results: 

A„,in(AgA; + fg) > A„in(AgA;)+A„in(fg) >a (16) 

A,™x(AgA; + fg) <A,™x(AgA;)+A™ax(f,?) (17) 
see lLutkepoh]|(ll996l^ . Moreover, being f g a diagonal matrix, then 

Kini^ g) = min Xj/ig and A,„ax(f g) = max i/Z/g, (18) 



8 



Francesca Greselin, Salvatore Ingrassia 



where 1//,^ denotes the i-th diagonal entry of the matrix Wg. 

Concerning the square d xd matrix AjjA^ {g = 1,...,G), we can get its eigenvalue 
decomposition, i.e. we can find Ag and Fg such that 

^gK = rg^sr'g (19) 

where Fg is the orthonormal matrix whose rows are the eigenvectors of AgA^ and Ag = 
diag(5ig, . . . , 5(jg) is the diagonal matrix of the eigenvalues of AgA'^, sorted in non increasing 
order, i.e. 5ig > 52g > . . . > 5qg > 0, and 5(^+i)g = • • • = S^g = 0. 

Now, we can apply the singular value decomposition to the d x q rectangular matrix 
Ag, so giving Ag = UgDgVg, where \Jg is a. d x d unitary matrix (i.e., such that UgUg = 
Irf) and Dg is a d X rectangular diagonal matrix with q nonnegative real numbers on the 
diagonal, known as singular values, and Vg is a ^ x ^ unitary matrix. The d columns of U 
and the cj columns of V are called the left singular vectors and right singular vectors of Ag, 
respectively. Now we have that 

AgA; = (UgDgv;) (VgD^u;) = UgDgi,p;u; = UgOgO^u; (20) 

and equating | |19| | and l l2Qb we get Fg = Ug and Ag = DgD'^, that is 

diag(5ig, . . . , Sqg) = diag{djg, . . .,dlg) . (21) 

with d\g > dig > ■ ■ ■ > dgg > 0. In particular, it is known that only the first q values of Dg 
are non negative, and the remaining d — q terms are null. Thus it results 

An,ax(AgA;)=4,. (22) 

Supposing now to choose a value for the upper bound b in such a way that b > f ,g for 
g = 1 , . . . , G and = \ ,. . . ,q, then constraints l ll6b and illl are satisfied when 

i=l,...,d (23) 
i=l,...,q (24) 

i = q+l,...,d (25) 

for g = 1 , . . . , G. In particular, we remark that condition (123) reduces to f^g > a for = 
iq+\),...,d. 



dig + Xffig > a 

dig < 



5 Constraints on the covariance matrix for factor analyzers 

The two-fold (eigenvalue and singular value) decomposition of the Ag presented above, sug- 
gests how to modify the EM algorithm in such a way that the eigenvalues of the covariances 
Eg (for g = 1 , . . . , G) are confined into suitable ranges. To this aim we have to implement 
constraints (|23j, (|24} and ( l25l >. 

We proceed as follows on the l)th iteration: 

1. Decompose Ag according to the singular value decomposition as Ag = UgDgVg; 

2. Compute the squared singular values (Jj^, . . . , d^^) of Ag; 

3. Create a copy D* of Dg'^^'' and a copy f* of f g*^'^ 

4. For i = 1 to q, if df^ + 1//;^+'' < a, then if a - 1//,^*+'^ > set dig ^ sja^^^f^ else 
dig Y^into D*; 
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6, 



5 




.(^•+1) 



into D* 



else dig into D*; 

7. For I = ^ + 1 to d, if v/;^+'' > b then set v/,^f+'^ ^ into f *; 

8. SetAf+'Vu^D;V^; 

9. Setff+'Vf*. 
10. Stop. ' 

It is important to remark that the resulting EM algorithm is monotone, once the initial 
guess, say E1, satisfies the constraints. Further, as shown in the case of gaussian mixtures in 
Ilngrassia and Roccil ( |2007|) . the maximization of the complete loglikelihood is guaranteed. 
From the other side, it is apparent that the above recipes require some a priori information 
on the covariance structure of the mixture, throughout the bounds a and b. 

6 Numerical studies 

In this section we present numerical studies, based on both simulated and real data sets, 
in order to show the performance of the constrained EM algorithm with respect to uncon- 
strained approaches. 

6. 1 Artificial data 

We consider here three mixtures of G components of J-variate normal distributions, for 
different values of the parameter Oq. First, we point out that the point of local maximum 
corresponding to the consistent estimator 6*, has been chosen to be the limit of the EM 
algorithm using the true parameter Oq as initial estimate, i.e. considering the true classifica- 
tion. In other words, we set Zig = 1 if the ;'th unit comes from the gth component and Zig = 
otherwise. In the following, such estimate will be referred to as the right maximum of the 
likelihood function. 

To begin with, we generate a set of 100 different random initial clusterings to initialize 
the algorithm at each run. To this aim, for a fixed number G of components of the mixture, we 
draw each time a set of random starting values for the zig from the multinomial distribution 
with values in (1,2, ... ,G) with parameters {pi,P2, ■ ■ - iPg) = {^/G,l/G,. . . A/G). Then 
we run a hundred times both the unconstrained and the constrained AECM algorithms (for 
different values of the constraints a, b) using the same set of initial clusterings in both cases. 
The initial values for the elements of and fg can be obtained as described at the end of 
Section[3]from the eigen-decomposition of S^, and the algorithms run until convergence or 
it reaches the fixed maximum number of iterations. 

The stopping criterion is based on the Aitken acceleration procedure (lAitkeniri926h , to 
estimate the asymptotic maximum of the log-likelihood at each iteration of the EM algorithm 
(in such a way, a decision can be made regarding whether or not the algorithm reaches 
convergence; that is, whether or not the log-likelihood is sufficiently close to its estimated 
asymptotic value). The Aitken acceleration at iteration k is given by 



^ik+\)__^(k) 



a' 



10 



Francesca Greselin, Salvatore Ingrassia 



where ^'■''K and if('^-i) are the log-hkelihood values from iterations k+ I, k, and 

k—l, respectively. Then, the asymptotic estimate of the log-likelihood at iteration 1 is 
given by 

1 — fl'*' V / ' 

see iBohning et oZl l ll994l) . In our analyses, the algorithms stop when Jfif^^'' — < e, 
with e = 0.001. Programs have been written in the R language; the different cases and the 
obtained results are described below. 

Mixture \: G = 'i,d = 6,q = 2,N = 150. 

The sample has been generated with weights a = (0.3,0.4,0.3)' according to the fol- 
lowing parameters: 



^ii = (0,0,0,0,0,0)' 
M2 = (5,5,5,5,5,5)' 
M3 = (10,10,10,10,10,10)' 
/ 0.50 1.00 \ 
1.00 0.45 
0.05 -0.50 
-0.60 0.50 
0.50 0.10 
\ 1.00 -0.15/ 



Hence, the covariance matrices Zg : 
values: 



f 1 = diag(0.1,0.1,0.1,0.1,0.1,0.1) 
f2 = diag(0.4,0.4,0.4,0.4,0.4,0.4) 
f3 = diag(0.2,0.2,0.2,0.2,0.2,0.2) 



A, 



An 



/O.IO 0.20 \ 
0.20 0.50 
1.00 -1.00 
-0.20 0.50 
1.00 0.70 

V 1.20 -0.30/ 



/ 0.10 0.20 \ 
0.20 0.00 
1.00 0.00 
-0.20 0.00 
1.00 0.00 

V 0.00 -1.30/ 



: A gA'^ + f g (g = 1 , 2, 3) have the following eigen- 



X{I.i) = (3.17, 1.63,0.10,0.10,0.10,0.10)' 
X{1.2) = (4.18,2.27,0.40,0.40,0.40,0.40)' 
^(£3) = (2.29, 1.93,0.20,0.20,0.20,0.20)', 

whose largest value is given by max,-,g A,- (Zg ) =4.18. 

First we run the unconstrained algorithm: the right solution has been attained in 24% 
of cases, without incurring in singularities. Summary statistics (minimum, first quartile Q\, 
median Qo, third quartile ^3 and maximum) about the distribution of the misclassification 
error over the 100 runs are reported in Table Due to the choice on parameters, we rarely 
expect too small eigenvalues in the estimated covariance matrices: we set a = 0.01 to protect 
from them; conversely, as local maxima are quite often due to large estimated eigenvalues, 
we consider setting also a constraint from above, taking into account some values for b, the 
upper bound. To compare how the choice of the bounds a and b influences the performance 
of the constrained EM, we experimented with different pairs of values, and in Table [3] we 
report the more interesting cases. Further results are reported in Figure [T] which provides 
the boxplots of the distribution of the misclassification errors obtained in the sequence of 
100 runs, showing the poor performance of the unconstrained algorithm compared with the 
good behaviour of its constrained version. For all values of the upper bound b, the third 
quartile of the misclassification error is steadily equal to 0. Indeed, for Z? = 6, 10 and 15 we 
had no misclassification error, while we observed very low and rare misclassification errors 
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Table 2 Mixture 1: Summary statistics of the distribution of the Misclassification Error over 100 rans of the 
unconstrained EM algorithm 

Misclassification Error 
min 6i Qi Qj max 
0% 17% 36% 45.3% 60% 



Table 3 Mixture 1 : Percentage of convergence to the right maximum of the constrained EM algorithms for 
a = 0.01 and some values of the upper constraint b 



+ 00 


6 


b 

10 


15 


20 


25 


24% 


100% 


100% 


100% 


97% 


89% 



unconstrained b=6 b=10 b=15 b=20 b=25 



Fig. 1 Mixture 1 : Boxplots of the misclassification error From left to right, the first boxplot refers to the 
unconstrained algorithm, then the following boxplots correspond to the constrained algorithm, for a = 0.01 
and b respectively set to the values b = 6, 10, 15,20,25. 



only for & = 20 and b = 25 (respectively 3 and 1 1 not null values, over 100 runs). Moreover, 
the robustness of the results with respect to the choice of the upper constraint is apparent. 

In Figure|2]we plot the classified data on the three factor spaces given by U,i , U/2 and U/3 
under the true maximum of the likelihood function (first rows of plots), while in the second 
row we give the classification obtained according to a spurious maximum of the likelihood 
function. 

We recall that an original data point x, can be represented in q dimensions by the poste- 
rior distribution of its associated g-dimensional latent factor U/. A convenient summary of 
this distribution is its mean. Hence we can portray the x/ in ^-dimensional space by plotting 
the estimated conditional expectation of each U/ given x,, that is, the (estimated) posterior 
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Fig. 2 Mixture 1: plot of the classified data on the three factor spaces, under the true maximum of the 
likelihood function (upper row) and, conversely, under a spurious maximum of the likelihood function (row 
below) 



mean of the factor U, (for / = i,. . . ,n). We have that 

u,- = Eg{U/|lx,} = 7(x;-x) 



where Eg denotes expectation using the estimate 9 instead of 9, and 7 has been computed 
foUowing ( [Tol l. 

In the particular case of ^7 = 2, as in this simulation experiment, we can draw the data in a 
bidimensional plot in Figure |2] From the two series of plots, it can be seen that the appro- 
priate factor space allows for the right classification, while a spurious likelihood maximizer 
leads to unsuitable factor spaces, which in turn generate serious issues in classification. 



Mixture 2: G = 4, d = 7, q = 2, N = 100. 

The sample has been generated with weights a = (0.2,0.3,0.35,0.15)' according to the 
following parameters: 

Ml = (0,0,0,0,0,0,0)' fi =diag(0.2,0.2, 0.2,0.2,0.2,0.2, 0.2) 

M2 = (5,5,5,5,5,5,5)' = diag(0.25, 0.25,0.25, 0.25,0.25, 0.25, 0.25) 

M3 = (10,10,10,10,10,10,10,)' f 3 =diag(0.15, 0.15,0.15, 0.15,0.15, 0.15, 0.15) 

JIX4 = (15,15,15,15,15,15,15)' ^4 = diag(0.1,0.1,0.1,0.1,0.1,0.1,0.1) 
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/ 0.30 
0.60 
0.03 
-0.36 
0.30 
0.60 

V-0.63 



0.60 
0.27 
-0.30 
0.30 
0.06 
-0.09 
1.50 / 



/ 0.08 
' 0.16 
0.80 
-0.16 
0.80 
0.96 
V 1.60 



0.16 \ 

0.40 ' 
-0.80 

0.40 

0.56 
-0.24 
-0.24/ 



A3 : 



/ 0.07 
' 0.14 
0.70 
-0.14 
0.70 
0.00 
V 0.70 



0.14 \ 

0.00 ' 

0.00 

0.00 

0.00 
-0.91 
-0.70/ 



I 0.04 
0.08 
0.40 
-0.08 
0.40 
0.00 

V-0.40 



0.08 
0.00 
0.00 
0.00 
0.00 
-0.52 
0.80 / 



The covariance matrices = AjjA'^ + fg (g = 1,2,3) have respectively the following 
eigenvalues: 

A(i:i) = (4.10, 1.14,0.33,0.21,0.15,0.09,0.04)' 
A(X2) = (7.62, 1.18,0.34,0.20,0.18,0.12,0.05)' 
A(X3) = (3.36, 1.36,0.24,0.17,0.14,0.10,0.09)' 
XiJlA) = (2.08,0.48,0.11,0.09,0.07,0.06,0.02)'. 

whose largest value is given by max,-,j A, (Zg) = 7.62. 

First we run the unconstrained algorithm: the right solution has been attained only once, 
over 100 runs. Afterwards, we run the constrained algorithm for different values of the 
upper bound b on the largest eigenvalue, while maintaining a = 0.01, and using the same 
random starting values as before, to compare how the choice of the bounds influences the 
performance of the constrained EM. In Table |4] we collected the percentage of times in 
which the algorithm attained the right maximum (where h = denotes the unconstrained 
procedure), showing a great improvement with respect to the previous 1% obtained through 
the unconstrained version. Further details are given in Figure|3]which shows the boxplots of 



Table 4 Mixture 2: Percentage of convergence to the right maximum of the constrained EM algorithms for 
a = 0.01 and different values for the upper bound b. 







b 








10 


15 


20 


25 


1% 


69% 


60% 


46% 


33% 



the distribution of the misclassification error in the 5 sequences of 100 runs, corresponding 
to the different values of the constraint b. Also in this case the unconstrained algorithm 
had a bad performance, with a median misclassification error of 0.53, while its constrained 
version, for Z? = 10 and 15, in more than 50% of the runs had no misclassification error. 
Furthermore, the unconstrained algorithm did not attain convergence in 4 out of the 100 
runs. 

Finally, in Figure |4] we plot the classified data on the factor spaces, under the true maxi- 
mum of the likelihood function, while in Figure|5]we give the classification in some wrong 
factor spaces, obtained according to a spurious maximum of the likelihood function. 

Mixture 3:G = 4, £/ = 7, ^ = 2, A' = 100. 

The third study concerns an artificial dataset analysed in lBaek et al\ ( l2010h . It has been 
generated with weights a = (0.5,0.5)' according to the following parameters: 



Ml = (0,0,0)' 



M2 = (2,2,6)' 
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Fig. 3 Mixture 2: Boxplots of the misclassification error: from left to right, the first boxplot refers to the 
unconstrained algorithm, then the following boxplots correspond to the constrained algorithm, for a = 0.01 
and b respectively set to the values b = 10, 15,20,25. 



/ 4 -1.8 -1\ / 4 1.8 .8\ 

El = -1.8 2 0.9 ^2 = 1-8 2 0.5 

V -1 0.9 2 / \0.80 0.5 2 / 

The covariance matiices (g = 1,2) have respectively the following eigenvalues: 

= (5.55,1.61,0.84)' 
1{L2) = (5.33,1.73,0.94)' 

We ran the unconstrained algorithm and its constrained version with the choices of a = 0.01 
and Z? = 6, 10, 15,20,25 as before, and also w e compare our proposal to the Mixture of Com- 
mon Factor Analyzers (MCFA) approach of iBaek and McLachlarj (l201lh . The percentages 
of convergence to the right maximum for the seven different cases are reported in Table |5] 
We recall that MCFA requires a common pattern between covariance matrices. This model 
is greatly employed in the literature, for parsimony and to avoid potential singularities with 
small clusters. Over the 100 rans, the MCFA EM algorithm did not converge in 36 cases, 



Table 5 Mixture 3: Percentage of convergence to the right maximum of the unconstrained EM, the con- 
strained EM algorithm and the MCFA EM algorithm 



imcons trained 




constrained 


MCFA 
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b=m b = \5 b = 20 


b = 25 
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100% 


96% 96% 97% 


97% 36% 
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Fig. 4 Mixture 2: plot of the classified data on the factor spaces, under the "right" solution given by the 
algorithm 



while it always reached convergence in the other cases. With respect to the performance of 
the different algorithms in terms of misclassification error, the corresponding boxplots are 
shown in Figure |6] We also note that the misclassification error was steadily equal to 1% 
over the 100 runs for the constrained algorithm with Z? = 6, it was always equal to 1% except 
5 runs for the unconstrained algorithm, while in the case of MCFA we have Qi = Me = 1%, 
but 23 = 34.5% and Max = 50%. All these results show that, to attain good performance 
and robustness in estimation, our proposal works quite better. Furthermore, it allows for a 
more general solution in comparison to the rigid requirement of a common pattern between 
covariance matrices. As a consequence, also the log-likelihood of the model obtained by our 
constrained algorithm (Jf = —1032.218) is fairly greater than the log-likelihood obtained 
in MCFA model = -1 147.396). 
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Fig. 5 Mixture 2: plot of the classified data on the factor spaces, giving an example of the wrong classifica- 
tion, which is obtained when the algorithm converges to a spurious maximum of the loghkelihood 



6.2 Real data 



The Wine data set 



Now we consider the wine data, proposed in iForina et al\ lll986h . consisting of d = 21 
chemical and physical properties of three different cultivars of Italian wine: Barolo, Grigno- 
lino and Barbera. This dataset is often use d to test and compare the perform ance of various 
classification algorithms: among the m, in 



McNicholas and Mur 



using parsimo- 
nious Gaussian mixture models and in lAndrews and McNicholasI ( 120 llh using parsimonious 
mixtures of multivariate r-factor analyzers. 



Consider first the complete dataset, with d = 27. We run the EM algorithm starting from 
the true classification, and using the maximum likelihood estimate Q we get 3 misclassified 
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unconstrained 



b=25 MCFA 



Fig. 6 Mixture 3: Boxplots of the misclassification error From left to right, the first boxplot refers to the 
unconstrained algorithm, then the following boxplots correspond to the constrained algorithm, for a = 0.01 
and b respectively set to the values b = 6, 10, 15,20,25, and finally to the MCFA algorithm. 

units (i.e. Misclassification Eiror 1.69%). Based on estimates of A ^ and "Pg, we get 

A,nax(Ai)= 28513 KMM) =6345 A„ax(A3) = 9045 

An,ax('f'i) = 27830 A„ax('f'2)= 22532 K.A'i'i) = 21573. 

With the aim at comparing our results with the above findings in the literature, we first scaled 
the original data, and applied the Pgmm package (jMcNicholas et al., 20JJj). Using a set of 
three random starts, the best model (BIC) for the given range of factors and components 
(from 1 up to 4) is a CUU model with q = 4 and G = 3. The CUU acronym stands for a 
MGFA with patterned covariance matrices, with a common (C) volume and unconstrained 
(U) shapes and orientations among the G=3 components in the mixture. Factors for the best 
model are of dimension q=4, with BIC= -11427.65. The obtained classification is given by 
Table[6l showing only 2 misclassified units. 



Table 6 Pgmm package applied on the Wine dataset 



Classification table 





1 


2 


3 


1 


59 








2 


1 


69 


1 


3 








48 



Then we employed our approach, after scaling the data and using hierarchical cluster- 
ing for initialization (as in the previously cited work). We obtained 5 misclassified units 
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(which means a misclassification error of 2.8%). If we initialize the EM algorithm with 
the true belonging of units and considering still 4-dimensional factors, we obtain a perfect 
classification. We also obtain a better fit of the model to the data, assessed by a greater 
penalized likelihood value, namely BIC= -10814.68, due to the lighter constraints we are 
imposing h ere. Finally, we employed a mix ture of /-factor analyzers, applying the teigen 
R-package ( [Andrews and McNicholasL l2012h . on the scaled data. We considered patterned 
models, whose label is a sequence of four letters: each letter can be "C" or "U" or "I" 
denoting "Constrained to be equal", "Unconstrained" and "Isotropic" patterns on group co- 
variances, and the four letters in the model label are respectively referred to volume, shape, 
orientation, and the degrees of freedom of the f-distribution. We got that the best fit (BIC 
=-11939.94) is given by CICC model with G=5, and this is somehow surprising as we al- 
ways obtained 3 groups, by all the methods seen so far, in particular also in the proposed 
constrained EM approach for gaussian factors. 

The Flea Beetles data set 

The flea beetles dat a were introduced by iLubischewl (Il962h and are available within 
the GGobi software, see ISwavne et all ( l2006h . Data were collected on 74 specimens of flea 
beetle of the genus Chaetocnema, which contains three species: concinna, heptapotamica, 
or heikertingeri. Measurements were collected on the width (in the fore-part and from the 
side) and angle of the aedeagus, on the width of the first and second joint of the tarsus, and 
on the width of the head between the external edges of the eyes of each beetle. 

The goal of the original study was to form a classification rale to dis tinguish the three 

specie s. To this aim, we considered q = 2 factors, according to the results of I Andrews and McNicholaa 
( l201lh . and we ran firstly the unconstrained algorithms. Over the 100 rans, the uncon- 
strained EM algorithm never reached the trae solution, and summary statistics (minimum, 
first quartile Q\, median Q2, third quartile Qt, and maximum) about the distribution of the 
misclassification error over the 100 rans are reported in Tablel?] The first results motivated 

Table 7 Flea Beetles data: Summary statistics of the distribution of the Misclassification Error over 100 rans 
of the unconstrained EM algorithm 

Misclassification Error 

min Qi Q2 63 max 

4.1% 28.0% 36.5% 41.9% 51.4% 



US to ran also the constrained EM algorithm, to see if it improves convergence to the right 
maximum and consequent classification. Tacking into account that 

minA,(i:^) = 0.64 maxA,(i:g) = 191.55, 

we considered constrained estimation with 

lower bound a either 0. 1 or 0.5, and upper bound b either 200 or 300. 

Over the 100 rans, the constrained algorithm steadily improves all results, as it can be 
seen in Table [8l which shows also that the best results can be obtained with the tightest 
constraints, i.e. a = 0.05, = 200. 
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Table 8 Flea Beetles data: Percentage of convergence to the right maximum of the unconstrained EM and 
the constrained EM algorithm 



unconstrained 
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unconst a=0.1,b=200 a=0.5, b=200 a=0.1,b=300 a=0.5,b=300 



Fig. 7 Flea Beetles data: Boxplots of the misclassification error. From left to right, the first boxplot refers to 
the unconstrained algorithm, then the following boxplots correspond to the constrained algorithm, for each 
pair of bounds (a, fo) 



Figure|7]provides the boxplots of the distribution of the of the 100 misclassification er- 
rors in the sequences of 100 runs for both unconstrained and constrained algorithms. The 
impact of the lower bound a on the estimation is critical, but it seems not to depend too 
much on its value (remember that its purpose is to protect against divergence of the algo- 
rithm) while the upper bound b crucially drives the classification results, showing the best 
performance when it mimics the value of the largest eigenvalue of the ^g's. As a final com- 
ment, it is worth mentioning that, when dealing with EM estimation based on random starts, 
authors in the literature usually give results in terms of "best outcome over a small number 
of runs", say 10 runs for instance. Therefore, we can conclude that the constrained algorithm 
(having a performance of 31%) provides the true solution and the perfect classification for 
the Flea Bleetles dataset. 



7 Concluding remarks 



Mixtures of factor analyzers are commonly used to explain data, in particular, correlation 
between variables in multivariate observations, allowing also for dimensionality reduction. 
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For these models, as well as for gaussian mixtures, however, the loglikelihood function may 
present spurious maxima and singularities and this is due to specific patterns of the estimated 
covariance structure. It is known, from the literature, that a constrained formulation of the 
EM algorithm considerably reduces such drawbacks for gaussian mixtures. Motivated by 
these considerations, in this paper we introduced a constrained approach for gaussian mix- 
tures of factor analyzers. In particular we implemented a methodology to maximize the 
likelihood function in a constrained parameter space, having no singularities and a reduced 
number of spurious local maxima. The performance of the newly introduced estimation ap- 
proach has been shown and compared to the usual non-constrained one, as well as to the 
approach based on common factors. To this purpose we present numerical simulations on 
synthetic samples and applications to real data sets widely employed in the literature. The 
results shows that the problematic convergence of the EM, even more critical when dealing 
with factor analyzers, can be greatly improved. 
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